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Abstract. In this paper we obtain non-uniform exponential upper bounds 
for tlie rate of convergence of a version of the algorithm Context, when the 
underlying tree is not necessarily bounded. The algorithm Context is a well- 
known tool to estimate the context tree of a Variable Length Markov Chain. 
As a consequence of the exponential bounds we obtain a strong consistency 
result. We generalize in this way several previous results in the field. 
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1. Introduction 

In this paper we present an exponential bound for the rate of convergence of 
the algorithm Context for a class of unbounded variable memory models, taking 
values on a finite alphabet A. From this it follows a strong consistency result for the 
algorithm Context in this setting. Variable memory models were first introduced 
in the information theory literature by Rissanen [11] as a universal system for data 
compression. Originally called by Rissanen Gnite memory source or probabilistic 
tree, this class of models recently became popular in the statistics literature under 
the name of Variable Length Markov Chains (VLMC) [1]. 

The idea behind the notion of variable memory models is that the probabilis- 
tic definition of each symbol only depends on a finite part of the past and the length 
of this relevant portion is a function of the past itself. Following Rissanen we call 
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context the minimal relevant part of each past. The set of all contexts satisfies the 
suffix property which means that no context is a proper suffix of another context. 
This property allows to represent the set of all contexts as a rooted labeled tree. 
With this representation the process is described by the tree of all contexts and a 
associated family of probability measures on A, indexed by the tree of contexts. 
Given a context, its associated probability measure gives the probability of the 
next symbol for any past having this context as a suffix. From now on the pair 
composed by the context tree and the associated family of probability measures 
will be called probabilistic context tree. 

Rissanen not only introduced the notion of variable memory models but he 
also introduced the algorithm Context to estimate the probabilistic context tree. 
The way the algorithm Context works can be summarized as follows. Given a 
sample produced by a chain with variable memory, we start with a maximal tree 
of candidate contexts for the sample. The branches of this first tree are then 
pruned until we obtain a minimal tree of contexts well adapted to the sample. 
We associate to each context an estimated probability transition defined as the 
proportion of time the context appears in the sample followed by each one of 
the symbols in the alphabet. Prom Rissanen [11] to Galves et al. [10], passing by 
Ron et al. [12] and Biihlmann and Wyner [1], several variants of the algorithm 
Context have been presented in the literature. In all the variants the decision to 
prune a branch is taken by considering a cost function. A branch is pruned if 
the cost function assumes a value smaller than a given threshold. The estimated 
context tree is the smallest tree satisfying this condition. The estimated family of 
probability transitions is the one associated to the minimal tree of contexts. 

In his seminal paper Rissanen proved the weak consistency of the algorithm 
Context in the case where the contexts have a bounded length, i.e. where the tree 
of contexts is finite. Biihlmann and Wyncr [1] proved the weak consistency of the 
algorithm also in the finite case without assuming a priori known bound on the 
maximal length of the memory, but using a bound allowed to grow with the size of 
the sample. In both papers the cost function is defined using the log likelihood ratio 
test to compare two candidate trees and the main ingredient of the consistency 
proofs was the chi-square approximation to the log likelihood ratio test for Markov 
chains of fixed order. A different way to prove the consistency in the finite case 
was introduced in [10], using exponential inequalities for the estimated transition 
probabilities associated to the candidate contexts. As a consequence they obtain 
an exponential upper bound for the rate of convergence of their variant of the 
algorithm Context. 

The unbounded case as far as we know was first considered by Ferrari and 
Wyner [8] who also proved a weak consistency result for the algorithm Context in 
this more general setting. The unbounded case was also considered by Csiszar and 
Talata [3] who introduced a different approach for the estimation of the proba- 
bilistic context tree using the Bayesian Information Criterion (BIC) as well as the 
Minimum Description Length Principle (MDL). We refer the reader to this last 
paper for a nice description of other approaches and results in this field, including 
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the context tree maximizing algorithm by Willems et al. [14]. With exception of 
Weinberger et al. [13], the issue of the rate of convergence of the algorithm es- 
timating the probabilistic context tree was not addressed in the literature until 
recently. Weinberger et al. proved in the bounded case that the probability that 
the estimated tree differs from the finite context tree generating the sample is 
summable as a function of the sample size. Duarte et al. in [6] extends the orig- 
inal weak consistency result by Rissanen [11] to the unbounded case. Assuming 
weaker hypothesis than [8], they showed that the on-line estimation of the context 
function decreases as the inverse of the sample size. 

In the present paper we generalize the exponential inequality approach pre- 
sented in [10] to obtain an exponential upper bound for the algorithm Context 
in the case of unbounded probabilistic context trees. Under suitable conditions, 
we prove that the truncated estimated context tree converges exponentially fast 
to the tree generating the sample, truncated at the same level. This improves all 
results known until now. 

The paper is organized as follows. In section 2 we give the definitions and 
state the main results. Section 3 is devoted to the proof of an exponential bound 
for conditional probabilities, for unbounded probabilistic context trees. In section 4 
we apply this exponential bound to estimate the rate of convergence of our version 
of the algorithm Context and to prove its consistency. 

2. Definitions and results 

In what follows A will represent a finite alphabet of size j^j. Given two integers 
m < n, we will denote by wj^ the sequence {Wm, ■ ■ ■ ,Wn) of symbols in A. The 
length of the sequence w"^ is denoted by i(w^) and is defined by i{w'^) ~ n—m+1. 
Any sequence with m > n represents the empty string and is denoted by A. 
The length of the empty string is ^(A) = 0. 

Given two finite sequences w and v, we will denote by vw the sequence 
of length £{v) + i{w) obtained by concatenating the two strings. In particular, 
Xw = wX = w. The concatenation of sequences is also extended to the case in 
which V denotes a semi- infinite sequence, that is t; = t^Z^. 

We say that the sequence s is a sujfix of the sequence w if there exists a 
sequence u, with > 1, such that w = us. In this case we write s -< w. When 
s ^ w or s = w we write s ^ w. Given a sequence w we denote by suf(w) the 
largest suffix of w. 

In the sequel A^ will denote the set of all sequences of length j over A and 
A* represents the set of all finite sequences, that is 

oo 

A* = \J AK 

Definition 2.1. A countable subset T of A* is a tree if no sequence s S T is a suffix 
of another sequence w & This property is called the suffix property. 
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We define the height of the tree T as 

h{T) = sup{e{w) :wgT}. 

In the case h{T) < +cx> it follows that T has a finite number of sequences. 
In this case we say that T is bounded and we will denote by |T| the number of 
sequences in T. On the other hand, if h{T) = +00 then T has a countable number 
of sequences. In this case we say that the tree T is unbounded. 

Given a tree T and an integer K we will denote by T\k the tree T truncated 
to level K, that is 

T\k = {w €T: e{w) < K}\j{w: £{w) = K and w ~<u, for some u € T}. 

We will say that a tree is MTeducible, if no sequence can be replaced by a 
suffix without violating the suffix property. This notion was introduced in [3] and 
generalizes the concept of complete tree. 

Definition 2.2. A probabilistic context tree over A is an ordered pair (T,p) such 
that 

1. T is an irreducible tree; 

2. p = {p{-\w)] w € T} is a family of transition probabilities over A. 

Consider a stationary stochastic chain (Xt)tgz over A. Given a sequence 
w e A^ we denote by 

= ¥{X{ = w) 

the stationary probability of the cylinder defined by the sequence w. If p{w) > 
we write 

p{a\w) = ¥{Xq = a \ Xz] = w) . 

Definition 2.3. A sequence w G A^ is a context for the process {Xt) if p{w) > 
and for any semi-infinite sequence xZ^^ such that w is a suffix of xZ^^ we have 
that 

P(Xo = a I Xzlc = xZla) = p{a\w), for all a € A, 
and no suffix of w satisfies this equation. 

Definition 2.4. We say that the process (X^) is compatible with the probabilistic 
context tree (T, p) if the following conditions are satisfied 

1. w e T if and only if w is a context for the process (Xt). 

2. For any w gT and any a G A, p{a\w) = F{Xo = a \ Xzj^^j^^ = w). 

Define the sequence {ak)ke^ as 



ao := ^^Up{aH}, 



afc := inf^ ipf {^'('^l^)}- 



aeA 



From now on we will assume that the probabilistic context tree (T, p) satisfies 
the following assumptions. 
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Assumption 2.5. Non-nullness, that is infyj^ripialw)} > for any a G A. 

Assumption 2.6. Summability of the sequence (1 — ak), A; > 0. In this case denote 
by 

a := ^(1 - ak) < + oo. 

keN 

For a probabilistic context tree satisfying Assumptions 2.5 and 2.6, the max- 
imal coupling argument used in [7] , or alternatively the perfect simulation scheme 
presented in [2], imply the uniqueness of the law of the chain compatible with it. 

Given an integer fc > 1, we define 

Ck = {u G T\k : p{a\u) ^ p(a|suf(u)) for some a G A} 

and 

Dk = min max{|p(a|it) — p(a|suf(u))|}. 

We denote by 

Efe = min{p(w) : (.{w) < k and p{w) > }. 

In what follows we will assume that xo,Xi, . . . , Xn-i is a sample of the sta- 
tionary stochastic chain {Xt) compatible with the probabilistic context tree {T,p). 

For any finite string w with (.{w) < n, we denote by Nn{w) the number of 
occurrences of the string in the sample; that is 

n—£{w) 
t=0 

For any element a G A , the empirical transition probability p„(a|t«) is defined 

by 

Nn{wa) + 1 

where 

This definition of p„(a|w) is convenient because it is asymptotically equivalent 
to ^^^^ and it avoids an extra definition in the case Nn{w-) = 0. 

A variant of Rissancn's algorithm Context is defined as follows. First of all, 
let us define for any finite string w £ A*: 

= max|p„(a|w) - p„(a|suf(w))|. 

a&A 

The A„ (w) operator computes a distance between the empirical transition proba- 
bilities associated to the sequence w and the one associated to the sequence suf(w). 
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Definition 2.8. Given S > and d < n, the tree estimated with the algorithm 

Context is 

f^'d = {weAf: Nniaw) > 0, A„(ffl suf(w)) > S for some a € A and 

A„(uw) < 6 for all u e Af'^''^^'"'^ with Nn{uw-) > 1}. 

where A^ denotes the set of all sequences of length at most r. In the case £{w) = d 
we have Af~^''^^ = 0. 

It is easy to see that T^''^ is an irreducible tree. Moreover, the way wc defined 

in (2.7) associates a probability distribution to each sequence in 7^ 
The main result in this article is the following 

Theorem 2.9. Let {T,p) be a probabilistic context tree satisfying Assumptions 2.5 
and 2.6 and let (Xt) be a stationary stochastic chain compatible with {T,p). Then 
for any integer K, any d satisfying 

d> max minik: 3iu e Ck,w ]^ u} (2.10) 

u^TJ{u)<K ' ^ ^ 



any S < and any 



2(1^1 + 1) 
mm((), Dd-SjCd 



we have that 

where 

C 



Be(a + Q!o) 

As a consequence we obtain the following strong consistency result. 

Corollary 2.11. Under the conditions of Theorem 2.9 we have 

„ \k - J- \k, 
eventually almost surely as n ^ +oo. 



3. Exponential inequalities for empirical probabilities 

The main ingredient in the proof of Theorem 2.9 is the following exponential upper 
bound 

Theorem 3.1. For any finite sequence w, any symbol a £ A and any t > the 
following inequality holds 
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where 



C= ,/° (3.2) 
8e(a + ao) 



As a direct consequence of Theorem 3.1 we obtain the following corollary. 

Corollary 3.3. For any finite sequence w with p{w) > 0, any symbol a G A, any 
t > and any n > -^^j^ + i{w) the following inequality holds 

n\Ua\w)-p{a\w)\>t) < 2\A\eiexp[-{n-£{w)) ^""^^^jgg^^ ], 

where C is given by (3.2). 

To prove Theorem 3.1 we need a mixture property for processes compatible 
with a probabilistic context tree (T,p) satisfying Assumptions 2.5 and 2.6. This 
is the content of the following lemma. 

Lemma 3.4. Let {Xt) be a stationary stochastic chain compatible with the proba- 
bilistic context tree {T,p) satisfying Assumptions 2.5 and 2.6. Then, there exists 
a summable sequence {pi}ieN, satisfying 

Ep^<1 + ^. (3.5) 

such that for any i > 1, any k > i, any j > 1 and any finite sequence w{, the 
following inequality holds 

sup |P(4+^'-' = w{ I Xi = x\) - p{wi)\ < J2 Pk-i-i+i ■ (3-6) 

Proof. First note that 

inf^ P(X^^-^ = wi I Xt^ = u'_^x\) < P(X^^-^ = wi I Xi = x\) 

< sup P(X^-'-i = wi I Xl^ = u°_^x{). 

where A°° denotes the set of all semi-infinite sequences u'Loo- The reader can find 
a proof of the inequalities above in [7, Proposition 3]. Using this fact and the 
condition of stationarity it is sufficient to prove that for any fc > 0, 

sup |P(X^^'-^ = w{ I XZI = xZl)-p{w{)\ < Y^pu+i. 
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Note that for all pasts a;_^ we have 



k+j-l 



{\Xzl=x-_i,)-p{w{) 



■¥{X'^+^-'=w{\Xzl=u-_l)]dp{z 



< 



I 



\Pix'^+^-'=u,i\Xzi,=xZl) 



— oo —00/ 



dp{u). 



= wi I xzl = xzl) - P(x^^-i = wi I x: 



Therefore, applying the loss of memory property proved in [2, Corollary 4.1] we 
have that 

i-i 

1=0 

where pm is defined as the probability of return to the origin at time m of the 
Markov chain on N starting at time zero at the origin and having transition prob- 
abilities 

'oix, ify = x+l, 

1 - "x, if y=0, 

0, otherwise. 



p{x,y) = < 



(3.7) 



This concludes the proof of (3.6). To prove (3.5), let (Zn) be the Markov chain 
with probability transitions given by (3.7). By definition we have 



11(1 -pi) = n Y^^iZi = j I Zi_, = j - l)P(Z;_i = i - 1) 

1>1 1>1 j=l 

1-2 

1>1 1=0 (>0 

From this, using the inequality x < — ln(l — x) < which holds for any 

X e (— l,c], it follows that 

1-ai 



^Pi < -2^1ogaj < 2^ 



1>1 l>0 

This concludes the proof of the lemma. 

We are now ready to prove Theorem 3.1. 



ao 



□ 



Proof of Theorem 3.1. Let w be a finite sequence and a any symbol in A. Define 
the random variables 

Uj=l{X'+'^''^ =wa}-p{wa), 
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for j = 0, . . . , n — £{wa). Then, using [4, Proposition 4] we have that, for any p > 2 
\\Nr,{wa) - {n - e{w))p{wa)\\p 

n~i{wa) n—£(wa) ^ 



1=0 k=i 

n — i{wa) n~£(wa) 



^ E E 

i=0 k=i 

< (2peiwa){n-eiw)) 



sup \¥{x; 

2{a + ao)\i 
ao ) 



= wa I Xi^^'^""^ =u)-p{wa)\y 



Then, as in [5, Proposition 5] we also obtain that, for any t > 0, 



F{\Nn{wa) - (n - e{w))p{'wa)\ > i) < e» exp[ 



{n-£{w))e{wa)- 



where 



C 



ao 



8e(a + ao) 



□ 



< 



1^1 + 1 



Proof of Corollary 3.3. First observe that 

(n - £{w))p{wa) + 1 
p[a\w) - e{w))p{w) + \A\ " {n - e{w))p{w)' 

Then, for all n > {\A\ + l)/tp{w) + i{w) we have that 

Nn{wa) + \ {n - e{w))p{wa) + 1 



< 



iV„(«;-) + l^l (n - e{w))p{w) + \A\ 



>t 



\A\ + l 



{n — t{w))p{w) 



Denote hyt' = t- {\A\ + l)/(n - l{w))p{w). Then 
Nn{wa) + l {n - £{w))p{wa) + 1 



Nn{w-) + \A\ (n - e{w))p{w) + \A\ 



>t') 



t' 



< V{\Nr,{wa) - (n - £{w))p{wa)\ > -[{n- l{w))p{w) + \A\]) 



6GA 



V{\N^{wh)-{n-l{w))p{wh)\ > —[{n-£iw))p{w) + \A\]). 

last 



Now, we can apply Theorem 3.1 to bound above the last sum by 
2\A\e- exp[-(n -£(«;)) 



A\A\^l{wa) 
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where 

C = 

8e(a + ao) 

This finishes the proof of the corollary. □ 



4. Proof of the lUcdn results 

Proof of Theorem 2.9. Define 

0'/= U U {^n{uw)>5}, 

e{w)<K 

and 

U^'''= U fl {Ar.iuw)<6}. 

l(w)<K 

Then, if < n we have that 

The result follows from a succession of lemmas. 

Lemma 4.1. For any n > + d, for any w gT with £{w) < K and for any 

uw G 71^''' we have that 



P(A„(mw)>^) < 4|Apee exp[-(n-d) 



\i _ 1-41 + 1 12,2^ 
1 2 (n-af)crfJ ^-rf*-"-. 



4|A|2(d+l) J' 
where C is given by (3.2). 
Proof. Recall that 

An{uw) = max |p„(a|u?«) — pn{a\s\if{uw))\. 

Note that the fact w €T implies that for any finite sequence u with p{u) > and 
any symbol a e ^ we have p{a\w) = p{a\uw). Hence, 

F{An{uw) >5) < [^{\Ma\w) - p{a\w)\ > -) 

aeA 

+ F{\pn{a\uw) -p{a\uw)\ > -)]. 
Using Corollary 3.3 we can bound above the right hand side of the last inequality 

by 

_ |A| + 1 I2^2p 

4 e« exp|_-(« - d) > i 



4|A|2(d+l) J' 

where C is given by (3.2). □ 
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Lemma 4.2. For any n > "^j^^}^^^^ + d and for any w G 7^'"^ with £{w) < K we 
have that 



Dd-S _ \A\ + 1 i.2,2r< 
2 {n-d)cj -1 



P( fl {A„(u«;)<5})<4|A|eiexp[-(n-rf)^^^ „ 

where C is given by (3.2). 

Proof. As d satisfies (2.10) there exists tilv € T\d such that p{a\inv) ^ p{a\suf{uw)) 
for some a € A. Then 

P( fl {Ar^iuw) < 6}) < F{A„{uw) < 6). 

Observe that for any a G A, 
\Pn{a\suf{iiw)) — Pn{a\uw)\ > \p{a\su{{uw)) — p{a\itw)\ 

— \p„{a\suf{v/w)) — p{a\sui{uw))\ — \pn{a\uw) — p{a\uw)\. 
Hence, we have that for any a G A 

An{iiw) > Dii — \pn{a\su{{iLw)) — p{a\sv£{uw))\ — |p„(a|uw) — p{a\iiw)\. 
Therefore, 

V{An{u-w) <6) < P( fl { |p„(a|suf(u-u;)) - p{a\suf{u-w))\ > } ) 

aeA 

+P( fl { \pn{a\uw)-p{a\uw)\ > ^^})- 

aeA 

As 5 < Dd and n > + d wc can use Corollary 3.3 to bound above the 

right hand side of this inequality by 

\ Dd-S _ \A\ + 1 12 

4Nelexp[-(„-.) ' -^-,1^; ' ]. 

where C is given by (3.2). This concludes the proof of the lemma. □ 
Now we can finish the proof of Theorem 2.9. We have that 
P(tf' V 7^ Vk) = no'/) + ¥{Uy). 
Using the definition of Of^"^ and U^''^ we have that 

F{fy\K^T\K)< Yl E n^niuw) > 5)+ n f] ^n{uw)<S). 



Applying Lemma 4.1 and Lemma 4.2 we can bound above the last expression by 

r{fy\Kj^T\K) < Aei\A\d+^ exp[^{n-d) 



[mm(_2, j - (n-d)ej 



4|A|2(d+l) J' 
where C is given by (3.2). We conclude the proof of Theorem 2.9. □ 
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Proof of Corollary 2.11. It follows from Theorem 2.9, using the first Borel-Cantelli 
Lemma and the fact that the bounds for the error estimation of the context tree 
are summable in n for a fixed d satisfying (2.10) and 5 < D^. □ 

5. Final remarks 

The present paper presents an upper bound for the rate of convergence of a version 
of the algorithm Context, for unbounded context trees. This generalizes previous 
results obtained in [10] for the case of bounded variable memory processes. We 
obtain an exponential bound for the probability of incorrect estimation of the 
truncated context tree, when the estimator is given by Definition (2.8). Note that 
the definition of the context tree estimator depends on the parameter S, and this 
parameter appears in the exponent of the upper bound. To assure the consistency 
of the estimator we need to choose a S sufficiently small, depending on the transi- 
tion probabilities of the process. Therefore, our estimator is not universal, in the 
sense that for any fixed 6 it fails to be consistent for any process having < S. 
The same happens with the parameter d. In order to choose 6 and d not depending 
on the process, we can allow these parameters to be a function of n, in such a way 
dn goes to zero and dn goes to +00 as n diverges. When we do this, we loose the 
exponential property of the upper bound. 

As an anonymous referee has pointed out, Finesso et al. [9] proved that in the 
simpler case of estimating the order of a Markov chain, it is not possible to obtain 
pure exponential bounds for the overestimation event with a universal estimator. 
The above discussion illustrates this fact. 
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